15 research outputs found

    CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset

    Full text link
    The CoNLL-03 corpus is arguably the most well-known and utilized benchmark dataset for named entity recognition (NER). However, prior works found significant numbers of annotation errors, incompleteness, and inconsistencies in the data. This poses challenges to objectively comparing NER approaches and analyzing their errors, as current state-of-the-art models achieve F1-scores that are comparable to or even exceed the estimated noise level in CoNLL-03. To address this issue, we present a comprehensive relabeling effort assisted by automatic consistency checking that corrects 7.0% of all labels in the English CoNLL-03. Our effort adds a layer of entity linking annotation both for better explainability of NER labels and as additional safeguard of annotation quality. Our experimental evaluation finds not only that state-of-the-art approaches reach significantly higher F1-scores (97.1%) on our data, but crucially that the share of correct predictions falsely counted as errors due to annotation noise drops from 47% to 6%. This indicates that our resource is well suited to analyze the remaining errors made by state-of-the-art models, and that the theoretical upper bound even on high resource, coarse-grained NER is not yet reached. To facilitate such analysis, we make CleanCoNLL publicly available to the research community.Comment: EMNLP 2023 camera-ready versio

    Automatic preservation watch using information extraction on the Web: a case study on semantic extraction of natural language for digital preservation

    Get PDF
    The ability to recognize when digital content is becoming endangered is essential for maintaining the long-term, continuous and authentic access to digital assets. To achieve this ability, knowledge about aspects of the world that might hinder the preservation of content is needed. However, the processes of gathering, managing and reasoning on knowledge can become manually infeasible when the volume and heterogeneity of content increases, multiplying the aspects to monitor. Automation of these processes is possible [11,21], but its usefulness is limited by the data it is able to gather. Up to now, automatic digital preservation processes have been restricted to knowledge expressed in a machine understandable language, ignoring a plethora of data expressed in natural language, such as the DPC Technology Watch Reports, which could greatly contribute to the completeness and freshness of data about aspects of the world related to digital preservation. This paper presents a real case scenario from the National Library of the Netherlands, where the monitoring of publishers and journals is needed. This knowledge is mostly represented in natural language on Web sites of the publishers and, therefore, is dificult to automatically monitor. In this paper, we demonstrate how we use information extraction technologies to end and extract machine readable information on publishers and journals for ingestion into automatic digital preservation watch tools. We show that the results of automatic semantic extraction are a good complement to existing knowledge bases on publishers [9, 20], finding newer and more complete data. We demonstrate the viability of the approach as an alternative or auxiliary method for automatically gathering information on preservation risks in digital content.KEEP SOLUTION

    Exploratory Relation Extraction in Large Multilingual Data

    No full text
    The task of Relation Extraction (RE) is concerned with creating extractors that automatically find structured, relational information in unstructured data such as natural language text. Motivated by an explosion of sources of readily available text data such as the Web, RE offers intriguing possibilities for querying, organizing, and analyzing information by drawing upon the clean semantics of structured databases and the abundance of unstructured data. However, practical applications of RE are often characterized by vague and shifting information needs on the one hand and large multilingual datasets of unknown content on the other. Classical RE approaches are unable to handle such scenarios since they require a careful, upfront definition of extraction tasks before extractors can be created in an effort-intensive, time-consuming process. With this thesis, I propose the paradigm of Exploratory Relation Extraction (ERE), a user-driven but data-guided process of exploration for relations of interest in unknown data. I show how distributional evidence and an informed linguistic abstraction can be employed to allow users to openly explore a dataset for relations of interest and rapidly prototype extractors for discovered relations at minimal effort. Furthermore, I propose the use of a language-neutral representation of shallow semantics to address the issue of multilingual data. This representation enables a shared feature space for different languages against which extractors can be developed. I present a method that expands English-language Semantic Role Labeling (SRL) to other languages and use it to generate multilingual SRL resources for seven distinct languages from different language groups, namely Arabic, Chinese, French, German, Hindi, Russian and Spanish in order to bootstrap semantic parsers for these languages. Together, the researched approaches represent a novel way for data scientists to work with large multilingual datasets of unknown content

    Explorative Relationsextraktion in mehrsprachigen Massendaten

    No full text
    The task of Relation Extraction (RE) is concerned with creating extractors that automatically find structured, relational information in unstructured data such as natural language text. Motivated by an explosion of sources of readily available text data such as the Web, RE offers intriguing possibilities for querying, organizing, and analyzing information by drawing upon the clean semantics of structured databases and the abundance of unstructured data. However, practical applications of RE are often characterized by vague and shifting information needs on the one hand and large multilingual datasets of unknown content on the other. Classical RE approaches are unable to handle such scenarios since they require a careful, upfront definition of extraction tasks before extractors can be created in an effort-intensive, time-consuming process. With this thesis, I propose the paradigm of Exploratory Relation Extraction (ERE), a user-driven but data-guided process of exploration for relations of interest in unknown data. I show how distributional evidence and an informed linguistic abstraction can be employed to allow users to openly explore a dataset for relations of interest and rapidly prototype extractors for discovered relations at minimal effort. Furthermore, I propose the use of a language-neutral representation of shallow semantics to address the issue of multilingual data. This representation enables a shared feature space for different languages against which extractors can be developed. I present a method that expands English-language Semantic Role Labeling (SRL) to other languages and use it to generate multilingual SRL resources for seven distinct languages from different language groups, namely Arabic, Chinese, French, German, Hindi, Russian and Spanish in order to bootstrap semantic parsers for these languages. Together, the researched approaches represent a novel way for data scientists to work with large multilingual datasets of unknown content.Die Problemstellung der Relationsextraktion (RE) beschreibt die automatische Gewinnung strukturierter, relationaler Information aus unstrukturierten Daten wie zum Beispiel naturlichsprachlichem Text. Durch RE werden neue Arten der Strukturierung, Organisation und Analyse von Informationen ermoglicht, da sie eine Brücke zwischen der klar strukturierten Semantik von Datenbanken und der stetigen Explosion verfugbarer Textdaten zu bauen vermag. In der Praxis ist die Anwendung von RE allerdings problematisch; Anwendungsszenarien sind oft durch vage, sich schnell andernde Informationsbedürfnisse gekennzeichnet, sowie von großen, mehrsprachigen Datensatzen unbekannten Inhalts. In solchen Szenarien schlagen klassische RE Ansätze fehl, da Extraktionsaufgaben im Voraus sorgsam definiert werden mussen, woraufhin Extraktoren in einem zweiten Schritt mit hohem Aufwand gebaut werden. In dieser Dissertation stelle ich das neuartige Paradigma der Explorativen Relationsextraktion (ERE) vor. Hierbei handelt es sich um einen nutzergetriebenen, halbautomatischen Vorgang, mit dem neue Relationstypen in Datensatzen unbekannten Inhalts entdeckt werden können. Ich zeige, wie verteilungssemantische Statistiken und eine ausgewahlte linguistische Abstraktion angewendet werden, um Nutzern sowohl die Erkundung von Textdaten nach relationalen Informationen als auch das schnelle prototypische Erstellen von Extraktoren mit minimalem Aufwand zu ermoglichen. Für den Umgang mit mehrsprachigen Daten schlage ich darüber hinaus die Nutzung einer sprachubergreifenden Repräsentation flacher Semantik vor. Auf dieser Basis konnen ohne Zusatzaufwand sprachübergreifende Extraktoren erzeugt werden. Ich stelle eine Methode vor, mit der englischsprachige Semantische Rollen auf andere Sprachen ausgeweitet werden konnen und erzeuge damit umfassende Resourcen um mehrsprachige semantische Parser zu trainieren. Zusammengenommen stellen die in dieser Dissertation erforschten Methoden einen neuartigen Ansatz zum Umgang mit großen und mehrsprachigen Datensatzen unbekannten Inhalts dar

    The weltmodell: A data-driven commonsense knowledge base

    No full text
    Abstract We present the WELTMODELL, a commonsense knowledge base that was automatically generated from aggregated dependency parse fragments gathered from over 3.5 million English language books. We leverage the magnitude and diversity of this dataset to arrive at close to ten million distinct N-ary commonsense facts using techniques from open-domain Information Extraction (IE). Furthermore, we compute a range of measures of association and distributional similarity on this data. We present the results of our efforts using a browsable web demonstrator and publicly release all generated data for use and discussion by the research community. In this paper, we give an overview of our knowledge acquisition method and representation model, and present our web demonstrator

    Unsupervised Discovery of Relations and Discriminative Extraction Patterns

    No full text
    Unsupervised Relation Extraction (URE) is the task of extracting relations of a priori unknown semantic types using clustering methods on a vector space model of entity pairs and patterns. In this paper, we show that an informed feature generation technique based on dependency trees significantly improves clustering quality, as measured by the F-score, and therefore the ability of the URE method to discover relations in text. Furthermore, we extend URE to produce a set of weighted patterns for each identified relation that can be used by an information extraction system to find further instances of this relation. Each pattern is assigned to one or multiple relations with different confidence strengths, indicating how reliably a pattern evokes a relation, using the theory of Discriminative Category Matching. We evaluate our findings in two tasks against strong baselines and show significant improvements both in relation discovery and information extraction
    corecore